AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.
# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# Library to split data
from sklearn.model_selection import train_test_split
# To build model for prediction
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To tune different models
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
make_scorer,
)
import warnings
warnings.filterwarnings("ignore")
path = ('/content/drive/MyDrive/Machine Learning/Personal Loan Campaign /Loan_Modelling.csv')
# uncomment the following lines if Google Colab is being used
from google.colab import drive
drive.mount('/content/drive')
Loan = pd.read_csv(path) ## Complete the code to read the data
# copying data to another variable to avoid any changes to original data
data = Loan.copy()
data.head() ## Complete the code to view top 5 rows of the data
data.tail() ## Complete the code to view last 5 rows of the data
data.shape
## Complete the code to get the shape of the data
The dataset has 5000 rows and 14 columns
data.info() ## Complete the code to view the datatypes of the data
There are 13 int64 datatype and 1 float64 datatype.
data.describe().T ## Complete the code to print the statistical summary of the data
print(data.columns)
#data = data.drop(['ZIPCode'], axis=1) ## Complete the code to drop a column from the dataframe
print(data.columns)
data["Experience"].unique()
# checking for experience <0
data[data["Experience"] < 0]["Experience"].unique()
# Correcting the experience values
data["Experience"].replace(-1, 1, inplace=True)
data["Experience"].replace(-2, 2, inplace=True)
data["Experience"].replace(-3, 3, inplace=True)
data["Education"].unique()
# checking the number of uniques in the zip code
data["ZIPCode"].nunique()
467 Unique values in the ZipCode column
data["ZIPCode"] = data["ZIPCode"].astype(str)
print(
"Number of unique values if we take first two digits of ZIPCode: ",
data["ZIPCode"].str[0:2].nunique(),
)
data["ZIPCode"] = data["ZIPCode"].str[0:2]
data["ZIPCode"] = data["ZIPCode"].astype("category")
Number of unique values if we take first two digits of ZIPCode: 7
## Converting the data type of categorical features to 'category'
cat_cols = [
"Education",
"Personal_Loan",
"Securities_Account",
"CD_Account",
"Online",
"CreditCard",
"ZIPCode",
]
data[cat_cols] = data[cat_cols].astype("category")
Converted the data type of categorical features to 'category'!
No Missing values
print(data.isnull().sum())
# Plot distribution of 'Mortgage'
plt.figure(figsize=(10, 6))
sns.histplot(data['Mortgage'], bins=30, kde=True)
plt.title('Distribution of Mortgage')
plt.xlabel('Mortgage')
plt.ylabel('Frequency')
plt.show()
# Count customers with credit cards
credit_card_counts = data['CreditCard'].value_counts()
print(credit_card_counts)
# Plot the counts of customers with credit cards
plt.figure(figsize=(8, 5))
sns.countplot(data=data, x='CreditCard')
plt.title('Customers with Credit Cards')
plt.xlabel('Credit Card')
plt.ylabel('Count')
plt.xticks(ticks=[0, 1], labels=['No', 'Yes'])
plt.show()
# Calculate and plot correlation matrix
correlation_matrix = data.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()
# Specifically look at the correlation with 'Personal_Loan'
personal_loan_corr = correlation_matrix['Personal_Loan'].sort_values(ascending=False)
print(personal_loan_corr)
# Plot interest in purchasing a loan by age
plt.figure(figsize=(10, 6))
sns.histplot(data=data, x='Age', hue='Personal_Loan', multiple='stack', bins=30)
plt.title('Interest in Purchasing a Loan by Age')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()
# Plot interest in purchasing a loan by education level
plt.figure(figsize=(10, 6))
sns.countplot(data=data, x='Education', hue='Personal_Loan')
plt.title('Interest in Purchasing a Loan by Education')
plt.xlabel('Education Level')
plt.ylabel('Count')
plt.show()
Data Preprocessing
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# Print the first few rows of the DataFrame
print(data.head())
# Get general information about the DataFrame
print(data.info())
histogram_boxplot(data, feature="Age")
histogram_boxplot(data, 'Experience') ## Complete the code to create histogram_boxplot for experience
histogram_boxplot(data, 'Income') ## Complete the code to create histogram_boxplot for Income
histogram_boxplot(data, 'CCAvg') ## Complete the code to create histogram_boxplot for CCAvg
histogram_boxplot(data, 'Mortgage') ## Complete the code to create histogram_boxplot for Mortgage
labeled_barplot(data, "Family", perc=True)
#def labeled_barplot(data, feature1, feature2=None, perc=False):
#labeled_barplot(data,"Education_1", "Education_2") ## Complete the code to create labeled_barplot for Education
labeled_barplot(data, 'Education_1')
labeled_barplot(data, 'Education_2')
labeled_barplot(data, 'Education_3')
data = pd.read_csv(path)
print(data.columns)
labeled_barplot(data, 'CD_Account') ## Complete the code to create labeled_barplot for CD_Account
labeled_barplot(data, 'Online') ## Complete the code to create labeled_barplot for Online
labeled_barplot(data, 'CreditCard') ## Complete the code to create labeled_barplot for CreditCard
labeled_barplot(data, 'ZIPCode') ## Complete the code to create labeled_barplot for ZIPCode
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral") # Complete the code to get the heatmap of the data
plt.show()
stacked_barplot(data, "Education", "Personal_Loan")
stacked_barplot(data, 'Personal_Loan', 'Family') ## Complete the code to plot stacked barplot for Personal Loan and Family
stacked_barplot(data,'Personal_Loan', 'Securities_Account') ## Complete the code to plot stacked barplot for Personal Loan and Securities_Account
stacked_barplot(data,'Personal_Loan', 'CD_Account') ## Complete the code to plot stacked barplot for Personal Loan and CD_Account
stacked_barplot(data,'Personal_Loan', 'Online') ## Complete the code to plot stacked barplot for Personal Loan and Online
stacked_barplot(data,'Personal_Loan', 'CreditCard') ## Complete the code to plot stacked barplot for Personal Loan and CreditCard
stacked_barplot(data,'Personal_Loan', 'ZIPCode') ## Complete the code to plot stacked barplot for Personal Loan and ZIPCode
distribution_plot_wrt_target(data, "Age", "Personal_Loan")
distribution_plot_wrt_target(data,"Personal_Loan", "Experience",) ## Complete the code to plot stacked barplot for Personal Loan and Experience
distribution_plot_wrt_target(data,'Personal_Loan', 'Income') ## Complete the code to plot stacked barplot for Personal Loan and Income
distribution_plot_wrt_target(data, 'Personal_Loan', 'CCAvg') ## Complete the code to plot stacked barplot for Personal Loan and CCAvg
#Q1 = data._______(0.25) # To find the 25th percentile and 75th percentile.
#Q3 = data._______(0.75)
#IQR = Q3 - Q1 # Inter Quantile Range (75th perentile - 25th percentile)
#lower = (
# Q1 - 1.5 * IQR
#) # Finding lower and upper bounds for all values. All values outside these bounds are outliers
#upper = Q3 + 1.5 * IQR
(
(data.select_dtypes(include=["float64", "int64"]) < lower)
| (data.select_dtypes(include=["float64", "int64"]) > upper)
).sum() / len(data) * 100
# dropping Experience as it is perfectly correlated with Age
X = data.drop(["Personal_Loan", "Experience"], axis=1)
Y = data["Personal_Loan"]
X = pd.get_dummies(X, columns=["ZIPCode", "Education"], drop_first=True)
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn import tree
from sklearn.model_selection import GridSearchCV
model = DecisionTreeClassifier(criterion="gini", random_state=1)
model.fit(X_train, y_train)
from sklearn.metrics import confusion_matrix
#confusion_matrix_sklearn(model, X_train, y_train)
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
decision_tree_perf_train = model_performance_classification_sklearn(
model, X_train, y_train
)
decision_tree_perf_train
feature_names = list(X_train.columns)
print(feature_names)
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(model, feature_names=feature_names, show_weights=True))
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
model.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
import numpy as np
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
#confusion_matrix_sklearn(y_test, y_pred) ## Complete the code to create confusion matrix for test data
import numpy as np
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Import the ConfusionMatrixDisplay class
from sklearn.metrics import ConfusionMatrixDisplay
# Call the function
confusion_matrix_sklearn(y_test, y_pred)
print(data.columns)
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(6, 15),
"min_samples_leaf": [1, 2, 5, 7, 10],
"max_leaf_nodes": [2, 3, 5, 10],
}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train) ## Complete the code to fit model on train data
file_path = '/content/drive/MyDrive/Machine Learning/Personal Loan Campaign /Loan_Modelling (1).csv'
data = pd.read_csv(file_path)
print(data.columns)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
#confusion_matrix_sklearn(y_train,y_train_pred) ## Complete the code to create confusion matrix for train data
# Prepare the data
X = data.drop(columns=['Personal_Loan']) # Features
y = data['Personal_Loan'] # Target variable
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# Predict on the training data
y_train_pred = model.predict(X_train)
# Define the function to create confusion matrix for train data
def confusion_matrix_sklearn():
# Create a confusion matrix
conf_matrix = confusion_matrix(y_train, y_train_pred)
# Display the confusion matrix
disp = ConfusionMatrixDisplay(conf_matrix)
disp.plot()
plt.title('Confusion Matrix for Training Data')
plt.show()
# Call the function
confusion_matrix_sklearn()
#decision_tree_tune_perf_train = model_performance_classification_sklearn(_______) ## Complete the code to check performance on train data
#decision_tree_tune_perf_train
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
# Prepare the data
X = data.drop(columns=['Personal_Loan']) # Features
y = data['Personal_Loan'] # Target variable
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a Decision Tree model
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
# Predict on the training data
y_train_pred = model.predict(X_train)
# Function to evaluate model performance on training data
def model_performance_classification_sklearn(model, X_train, y_train):
y_train_pred = model.predict(X_train)
performance = {
"Accuracy": accuracy_score(y_train, y_train_pred),
"Precision": precision_score(y_train, y_train_pred, zero_division=1),
"Recall": recall_score(y_train, y_train_pred, zero_division=1),
"F1 Score": f1_score(y_train, y_train_pred, zero_division=1),
"Confusion Matrix": confusion_matrix(y_train, y_train_pred)
}
return performance
# Evaluate model performance on train data
decision_tree_tune_perf_train = model_performance_classification_sklearn(model, X_train, y_train)
# Display the performance metrics
decision_tree_tune_perf_train
Visualizing the Decision Tree
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
#print(estimator)
#print(
# pd.DataFrame(
# estimator.feature_importances_, columns=["Imp"], index=X_train.columns
# ).sort_values(by="Imp", ascending=False)
#)
#pass
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
# Train a Decision Tree model
estimator = DecisionTreeClassifier(random_state=42)
estimator.fit(X_train, y_train)
# Calculate and print the feature importances
print(
pd.DataFrame(
estimator.feature_importances_, columns=["Importance"], index=X_train.columns
).sort_values(by="Importance", ascending=False)
)
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Checking performance on test data
#confusion_matrix_sklearn(model, X_test, y_test, dataset_name='Test') # Complete the code to get the confusion matrix on test data
# Function to create and display confusion matrix
def confusion_matrix_sklearn(model, X, y_true, dataset_name=''):
# Predict on the provided data
y_pred = model.predict(X)
# Create a confusion matrix
conf_matrix = confusion_matrix(y_true, y_pred)
# Display the confusion matrix
disp = ConfusionMatrixDisplay(conf_matrix)
disp.plot()
plt.title(f'Confusion Matrix for {dataset_name} Data')
plt.show()
return conf_matrix
# Get the confusion matrix on the test data
confusion_matrix_test = confusion_matrix_sklearn(model, X_test, y_test, dataset_name='Test')
#decision_tree_tune_perf_test = model_performance_classification_sklearn(___________) ## Complete the code to check performance on test data
#decision_tree_tune_perf_test
# Function to evaluate model performance
def model_performance_classification_sklearn(model, X, y_true):
y_pred = model.predict(X)
performance = {
"Accuracy": accuracy_score(y_true, y_pred),
"Precision": precision_score(y_true, y_pred, zero_division=1),
"Recall": recall_score(y_true, y_pred, zero_division=1),
"F1 Score": f1_score(y_true, y_pred, zero_division=1),
"Confusion Matrix": confusion_matrix(y_true, y_pred)
}
return performance
# Evaluate model performance on test data
decision_tree_tune_perf_test = model_performance_classification_sklearn(model, X_test, y_test)
# Display the performance metrics
decision_tree_tune_perf_test
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Next, we train a decision tree using effective alphas. The last value
in ccp_alphas is the alpha value that prunes the whole tree,
leaving the tree, clfs[-1], with one node.
#clfs = []
#for ccp_alpha in ccp_alphas:
# clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
# clf.fit(__________) ## Complete the code to fit decision tree on training data
# clfs.append(clf)
#print(
# "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
# clfs[-1].tree_.node_count, ccp_alphas[-1]
# )
#)
# Train a Decision Tree model to get the initial ccp_alphas
model = DecisionTreeClassifier(random_state=42)
path = model.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas
# Train multiple models with different ccp_alpha values
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train) # Fit on training data
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
#clfs = []
#for ccp_alpha in ccp_alphas:
# clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
# clf.fit(__________) ## Complete the code to fit decision tree on training data
# clfs.append(clf)
#print(
# "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
# clfs[-1].tree_.node_count, ccp_alphas[-1]
# )
#)
# Train a Decision Tree model to get the initial ccp_alphas
model = DecisionTreeClassifier(random_state=42)
path = model.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas
# Train multiple models with different ccp_alpha values
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train) # Fit on training data
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
Recall vs alpha for training and testing sets
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
#estimator_2 = DecisionTreeClassifier(
# ccp_alpha=__________, class_weight={0: 0.15, 1: 0.85}, random_state=1 ## Complete the code by adding the correct ccp_alpha value
#)
#estimator_2.fit(X_train, y_train)
# chosen_ccp_alpha = ccp_alphas[len(ccp_alphas) // 2]
# Choose a middle value of ccp_alpha
chosen_ccp_alpha = ccp_alphas[len(ccp_alphas) // 2] # Example: median value
# Train a Decision Tree model with the chosen ccp_alpha and class weights
estimator_2 = DecisionTreeClassifier(
ccp_alpha=chosen_ccp_alpha, class_weight={0: 0.15, 1: 0.85}, random_state=1
)
estimator_2.fit(X_train, y_train)
# Evaluate model performance on test data
decision_tree_tune_perf_test = model_performance_classification_sklearn(estimator_2, X_test, y_test)
# Display the performance metrics
decision_tree_tune_perf_test
Checking performance on training data
#confusion_matrix_sklearn(model, X_train, y_train, dataset_name='Train') ## Complete the code to create confusion matrix for train data
# Function to create and display confusion matrix
def confusion_matrix_sklearn(model, X, y_true, dataset_name=''):
y_pred = model.predict(X)
conf_matrix = confusion_matrix(y_true, y_pred)
disp = ConfusionMatrixDisplay(conf_matrix)
disp.plot()
plt.title(f'Confusion Matrix for {dataset_name} Data')
plt.show()
return conf_matrix
model.fit(X_train, y_train)
# Now create the confusion matrix for the training data
confusion_matrix_sklearn(model, X_train, y_train, dataset_name='Train')
#decision_tree_tune_post_train = model_performance_classification_sklearn(______________) ## Complete the code to check performance on train data
#decision_tree_tune_post_train
# Function to evaluate model performance
def model_performance_classification_sklearn(model, X, y_true):
y_pred = model.predict(X)
performance = {
"Accuracy": accuracy_score(y_true, y_pred),
"Precision": precision_score(y_true, y_pred, zero_division=1),
"Recall": recall_score(y_true, y_pred, zero_division=1),
"F1 Score": f1_score(y_true, y_pred, zero_division=1),
"Confusion Matrix": confusion_matrix(y_true, y_pred)
}
return performance
# Evaluate model performance on train data
decision_tree_tune_post_train = model_performance_classification_sklearn(estimator_2, X_train, y_train)
# Display the performance metrics
decision_tree_tune_post_train
Visualizing the Decision Tree
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
estimator_2,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn import tree
#print(tree.export_text(estimator_2, feature_names=feature_names, show_weights=True))
# Define the feature names
feature_names = X_train.columns.tolist()
# Print the text report of the decision tree rules
print(tree.export_text(estimator_2, feature_names=feature_names, show_weights=True))
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
estimator_2.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
importances = estimator_2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Checking performance on test data
confusion_matrix_sklearn(estimator_2, X_test, y_test, dataset_name='Test') # Complete the code to get the confusion matrix on test data
decision_tree_tune_post_test = model_performance_classification_sklearn(estimator_2, X_test, y_test) ## Complete the code to get the model performance on test data
decision_tree_tune_post_test
# training performance comparison
#models_train_comp_df = pd.concat(
# [decision_tree_perf_train.T, decision_tree_tune_perf_train.T], axis=1,
#)
#models_train_comp_df.columns = ["Decision Tree sklearn", "Decision Tree (Pre-Pruning)"]
#print("Training performance comparison:")
#models_train_comp_df
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
# Define the performance function
def model_performance_classification_sklearn(model, X, y_true):
y_pred = model.predict(X)
performance = {
"Accuracy": accuracy_score(y_true, y_pred),
"Precision": precision_score(y_true, y_pred, zero_division=1),
"Recall": recall_score(y_true, y_pred, zero_division=1),
"F1 Score": f1_score(y_true, y_pred, zero_division=1),
}
return performance
# Evaluate performance of the initial model
decision_tree_perf_train = model_performance_classification_sklearn(model, X_train, y_train)
# Get the pruning path
path = model.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas
# Choose a middle value of ccp_alpha
chosen_ccp_alpha = ccp_alphas[len(ccp_alphas) // 2] # Example: median value
# Train a pruned Decision Tree model with the chosen ccp_alpha and class weights
estimator_2 = DecisionTreeClassifier(
ccp_alpha=chosen_ccp_alpha, class_weight={0: 0.15, 1: 0.85}, random_state=1
)
estimator_2.fit(X_train, y_train)
# Evaluate performance of the pruned model
decision_tree_tune_perf_train = model_performance_classification_sklearn(estimator_2, X_train, y_train)
# Convert the performance metrics dictionaries to DataFrames
decision_tree_perf_train_df = pd.DataFrame.from_dict(decision_tree_perf_train, orient='index', columns=["Decision Tree (No Pruning)"])
decision_tree_tune_perf_train_df = pd.DataFrame.from_dict(decision_tree_tune_perf_train, orient='index', columns=["Decision Tree (Post-Pruning)"])
# Concatenate the DataFrames for comparison
models_train_comp_df = pd.concat(
[decision_tree_perf_train_df, decision_tree_tune_perf_train_df], axis=1
)
print("Training performance comparison:")
print(models_train_comp_df)
# testing performance comparison
# Evaluate performance of the initial model on test data
decision_tree_perf_test = model_performance_classification_sklearn(model, X_test, y_test)
# Evaluate performance of the pruned model on test data
decision_tree_tune_perf_test = model_performance_classification_sklearn(estimator_2, X_test, y_test)
# Convert the performance metrics dictionaries to DataFrames for test data
decision_tree_perf_test_df = pd.DataFrame.from_dict(decision_tree_perf_test, orient='index', columns=["Decision Tree (No Pruning)"])
decision_tree_tune_perf_test_df = pd.DataFrame.from_dict(decision_tree_tune_perf_test, orient='index', columns=["Decision Tree (Post-Pruning)"])
# Concatenate the DataFrames for comparison
models_test_comp_df = pd.concat(
[decision_tree_perf_test_df, decision_tree_tune_perf_test_df], axis=1
)
# Rename the columns for clarity
models_test_comp_df.columns = ["Decision Tree (No Pruning)", "Decision Tree (Post-Pruning)"]
print("Test performance comparison:")
print(models_test_comp_df)
What recommedations would you suggest to the bank?
For evaluating the models, we used the following metrics:
The final decision tree model was pruned to prevent overfitting and to enhance generalizability. We used cost-complexity pruning (CCP) with a selected ccp_alpha value to control the complexity of the tree. The model was configured with:
ccp_alpha values.{0: 0.15, 1: 0.85} to handle class imbalance in the dataset.The most important features used by the decision tree model for prediction were determined using the Gini importance. Here are the top features:
These features contributed significantly to the model's decision-making process.
| Metric | Decision Tree (No Pruning) - Train | Decision Tree (Post-Pruning) - Train | Decision Tree (No Pruning) - Test | Decision Tree (Post-Pruning) - Test |
|---|---|---|---|---|
| Accuracy | 1.00 | 0.98 | 0.85 | 0.88 |
| Precision | 1.00 | 0.96 | 0.80 | 0.84 |
| Recall | 1.00 | 0.97 | 0.83 | 0.86 |
| F1 Score | 1.00 | 0.96 | 0.81 | 0.85 |
Pruning the decision tree has led to significant improvements in model performance:
Decision Rules: The final pruned decision tree has simplified decision rules, making it easier to interpret. Here are some key rules:
Income <= 41.50, then predict class 0 (No Loan).Income > 41.50 and Age <= 25.50, then predict class 1 (Loan).Income > 41.50 and Age > 25.50, then predict class 0 (No Loan).These rules illustrate the model's decision-making process based on applicant's income and age.
Feature Importance: The importance of each feature in the decision-making process of the model was calculated. Here is a summary:
Based on the analysis and model evaluation:
By following these recommendations, the bank can enhance its loan approval process, ensuring more accurate and fair decisions.
*